Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topo aware coll comm v2 #12032

Merged
merged 2 commits into from
Dec 2, 2023
Merged

Conversation

wenduwan
Copy link
Contributor

This patch refactors #11864 and prevents the deadlock reported in #12016

Specifically, this patch breaks down the disjointness flag determination logic in 2 parts:

  1. Exchange the local peers count of all processes' and get the max i.e. max_local_peers in a (non-blocking) allreduce unconditionally, and then
  2. In the activation callback function, check the max_local_peers value and set the disjointness flag.

As a result, this change makes the context->id field obsolete, and therefore we simply remove it. The allreduce also serves as a barrier function so we always invoke it, so we do not need the previous logic to check the parent disjointness flag - this simplifies the logic.

I ran the reproducer provided by @dalcinl

env MPI4PY_RC_THREAD_LEVEL=multiple \
  mpiexec -n 1 \
  -x PATH \
  -x LD_LIBRARY_PATH \
  python /home/ec2-user/mpi4py/test/main.py -k TestSpawnSingleSelf
...
[[email protected]] Python 3.7.16 (/home/ec2-user/env/bin/python)
[[email protected]] numpy 1.21.6 (/home/ec2-user/env/lib64/python3.7/site-packages/numpy)
[[email protected]] MPI 3.1 (Open MPI 5.1.0)
[[email protected]] mpi4py 4.0.0.dev0 (/home/ec2-user/mpi4py/build/lib.linux-x86_64-3.7/mpi4py)
........
----------------------------------------------------------------------
Ran 8 tests in 0.954s

OK

NOTE: I observed a deadlock behavior while making this change:

  • If I make 2 allreduce calls, i.e. context->id and context->max_local_peers, and pass them as subreqs to ompi_comm_request_schedule_append (request, ompi_comm_activate_nb_complete, &subreqs, 2); the reproducer will hang. Back trace shows that the allreduce never completes:
#0  0x00007fb5c6081377 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fb5b7035118 in PMIx_Lookup (pdata=0x7fff36046dc0, ndata=1, info=0x7fff36046900, ninfo=2) at client/pmix_client_pub.c:235
#2  0x00007fb5b7faeac5 in opal_pmix_base_exchange (indat=0x7fff360470f0, outdat=0x7fff36046dc0, timeout=600) at base/pmix_base_fns.c:75
#3  0x00007fb5b8283665 in ompi_comm_allreduce_pmix_reduce_complete (request=0x2db8d58) at communicator/comm_cid.c:1294
#4  0x00007fb5b8285e8b in ompi_comm_request_progress () at communicator/comm_request.c:154
#5  0x00007fb5b7ef2ea6 in opal_progress () at runtime/opal_progress.c:224
#6  0x00007fb5b7f3ab59 in ompi_sync_wait_mt (sync=0x7fff360473f0) at base/wait_sync.c:121
#7  0x00007fb5b827fafc in ompi_request_wait_completion (req=0x2db9618) at ../ompi/request/request.h:469
#8  0x00007fb5b8282732 in ompi_comm_activate (newcomm=0x7fff36047878, comm=0x7fb5b89a4c40 <ompi_mpi_comm_self>, bridgecomm=0x0, arg0=0x7fff360474f4, arg1=0x7fff360485d0, send_first=false,
    mode=256) at communicator/comm_cid.c:954
#9  0x00007fb5b829266a in ompi_dpm_connect_accept (comm=0x7fb5b89a4c40 <ompi_mpi_comm_self>, root=0, port_string=0x7fff360485d0 "987824129.0:1405746414", send_first=false,
    newcomm=0x7fff360489d0) at dpm/dpm.c:551
#10 0x00007fb5b82f2494 in PMPI_Comm_spawn (command=0x7fb59241d950 "/home/ec2-user/env/bin/python", argv=0x7fb598a4ff10, maxprocs=1, info=0x7fb5b89ae220 <ompi_mpi_info_null>, root=0,
    comm=0x7fb5b89a4c40 <ompi_mpi_comm_self>, intercomm=0x7fb592275250, array_of_errcodes=0x0) at comm_spawn.c:157
#11 0x00007fb5b8bb77d7 in __pyx_pf_6mpi4py_3MPI_9Intracomm_36Spawn (__pyx_v_self=0x7fb5a62d24e0, __pyx_v_info=0x7fb5a62d2360, __pyx_v_errcodes=0x7fb5c6a16c50 <_Py_NoneStruct>,
    __pyx_v_root=0, __pyx_v_maxprocs=1, __pyx_v_args=<optimized out>, __pyx_v_command=<optimized out>) at src/mpi4py/MPI.c:198402
#12 __pyx_pw_6mpi4py_3MPI_9Intracomm_37Spawn (__pyx_v_self=0x7fb5a62d24e0, __pyx_args=<optimized out>, __pyx_nargs=<optimized out>, __pyx_kwds=<optimized out>) at src/mpi4py/MPI.c:1589
#13 0x00007fb5b8ac7025 in __Pyx_PyVectorcall_FastCallDict_kw (kw=0x7fb59b187820, nargs=4, args=<optimized out>, vc=0x7fb5b8ac86b0 <__Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS>,
    func=0x7fb5a62e9d70) at src/mpi4py/MPI.c:297465
#14 __Pyx_PyVectorcall_FastCallDict (kw=0x7fb59b187820, nargs=4, args=<optimized out>, vc=0x7fb5b8ac86b0 <__Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS>, func=0x7fb5a62e9d70)
    at src/mpi4py/MPI.c:35334
#15 __Pyx_CyFunction_CallAsMethod (func=0x7fb5a62e9d70, args=<optimized out>, kw=0x7fb59b187820) at src/mpi4py/MPI.c:36021

The allreduce used in this case was ompi_comm_allreduce_intra_pmix_nb

static int ompi_comm_activate_complete (ompi_communicator_t **newcomm, ompi_communicator_t *comm)
/* Callback function to set comunicator disjointness flags */
static inline void ompi_comm_set_disjointness_nb_complete (ompi_comm_cid_context_t *context) {
if (!OMPI_COMM_IS_DISJOINT_SET(*context->newcommp)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should only be called once per communicator. Calling a second time should trigger an error, at least a message in debug mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I will think about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see update - I decided to add a warning message instead of hard failure.

* send messages over the new communicator
/**
* Dual-purpose barrier:
* 1. The communicator's disjointness is inferred fromimax_local_peers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks fixed.

ompi/communicator/comm_cid.c Show resolved Hide resolved
ompi/communicator/comm_cid.c Show resolved Hide resolved
@wenduwan wenduwan self-assigned this Nov 5, 2023
@wenduwan
Copy link
Contributor Author

@bosilca Heartbeat - I imagine you are busy in SC23. Please kindly review the change when you have time.

ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved
@bosilca bosilca self-requested a review December 2, 2023 05:14
Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please stash my comment into one of yours. Also, you need to sign your commits.

This patch introduces a new communicator flag to indicate if no
processes share the same node.

This flag is intended for optimization of hierarchy-aware collective
operations to select the more efficient transport/algorithm.

In this change we introduce a non-blocking allreducestep in communicator
activation and set the new flag in the completion callback function.

Signed-off-by: Wenduo Wang <[email protected]>
This patch introduces assertions to verify that sub-communicators
are created with the expected OMPI_COMM_DISJOINT* flags.

Signed-off-by: Wenduo Wang <[email protected]>
@wenduwan
Copy link
Contributor Author

wenduwan commented Dec 2, 2023

@bosilca Thank you very much! I have fixed the typo and rebased the branch.

@bosilca bosilca merged commit 76c62fd into open-mpi:main Dec 2, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants